ML System Monitoring and Continual learning
Causes of ML System Failures
Software system failures
- dependency failure
- deployment failure
- hardware failure
- downtime/crashing
ML-Specific Failures
- data distribution shifts: Machine Learning All-in-one#^8a1aad
- edge cases
Designing Machine Learning Systems
Edge cases vs. Outliers
- Outliers refer to data: an example that differs significantly from other examples. Edge cases refer to performance: an example where a model performs significantly worse than other examples.
- An outlier can cause a model to perform unusually poorly, which makes it an edge case. However, not all outliers are edge cases. For example, a person jaywalking on a highway is an outlier, but it’s not an edge case if your self-driving car can accurately detect that person and decide on a motion response appropriately.
- degenerate feedback loops
- created when a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs.
- especially common in tasks with natural labels from users, such as recommender systems and ads click-through-rate prediction: “exposure bias", “popularity bias”, “filter bubbles”
Monitoring & Observability
Monitoring
= the act of tracking, measuring, and logging different metrics that can help us determine when something goes wrong
- metrics
- operational metrics: the metrics that should be monitored with any software systems such as latency, throughput, and CPU utilization
- ML-specific metrics
- monitoring accuracy-related metrics
- monitoring predictions
- monitoring features
- monitoring raw inputs
- monitoring toolbox
- logs
- dashboards
- alerts
Observability
= setting up our system in a way that gives us visibility into our system to help us investigate what went wrong
Continual Learning
Types of model updates
- model iteration
= A new feature is added to an existing model architecture or the model architecture is changed. - data iteration
= The model architecture and features remain the same, but you refresh this model with new data
Stateful retraining vs. Stateless retraining
- stateful retraining (fine-tuning/incremental learning)
- the model continues training on new data
- mostly applied for data iteration
- stateless retraining
- the model is trained from scratch each time
Four Stages of Continual Learning
- Stage 1 - Manual, stateless retraining
- Stage 2 - Automated retraining: needs infrastructure
- Stage 3 - Automated, stateful training: needs set fixed model updating schedule
- Stage 4 - Continual learning: model will be updated automatically with triggers:
- time-based
- performance-based
- volume-based
- drift-based
Test in Production
There are several techniques for evaluating the model in production
Blue/Green
- derived from software development
- shift all traffic to the new model by updating Load Balancer
Shadow deployment /Challenger (challenger model)
- Deploy the candidate model in parallel with the existing model.
- For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user.
- Log the predictions from the new model for analysis purposes
- Replace the existing model with the new model if the new model's predictions are satisfactory
A/B testing (AB testing)
- Deploy the candidate model alongside the existing model.
- A percentage of traffic is routed to the new model for predictions; the rest is routed to the existing model for predictions.
- Monitor and analyze the predictions and user feedback, and do stats test if any difference in prediction within long validation cycles
Canary release
- Deploy the candidate model alongside the existing model. The candidate model is called the canary.
- A portion of the traffic is routed to the candidate model.
- If its performance is satisfactory, increase the traffic to the candidate model. If not, abort the canary and route all the traffic back to the existing model.
- Stop when either the canary serves all the traffic (the candidate model has replaced the existing model) or when the canary is aborted.
Interleaving experiments
Multi-Armed Bandits
- different other static approaches, it is a dynamic approach = it tests model versions using reinforcement learning
- exploit & explore